Note:

  • Upgrade the matplotlib version form 3.2.2 to 3.5.3

    • Use command

      !pip install matplotlib --upgrade
  • Restart the session

1. Defining Problem Statement and Analysing basic metrics¶

  This business case is about one of the most popular media and video streaming platforms. They have over 10000 movies or TV shows available on their platform, as of mid-2021, they have over 222M Subscribers globally. They are planning to launch new shows based on data driven analysis. The business has provided us a tabular dataset consisting of listings of all the movies and TV shows available on the platform, along with details such as - cast, directors, ratings, release year, duration, etc.

  By analysis data related to all the shows on the platform, it is interested in generating insights that could help them in deciding which type of shows/movies to produce and how they can grow the business in different countries. It is mainly interested in knowing the tailor maid strategies to grow their business in different countries, also to understand what type of shows has worked for them.

  This study is based on the assumption that all the shows that are listed in the dataset has good user response and that is why they are still on the platform. It has presence in almost all the countries. However, this study will focus mainly on the countries having most number of shows, since the platform already has good presence in those countries and we have a good amount of data to understand user preference.

  The shows on the platform are of two types: Movies and TV Shows. It is generally observed that the users, those prefer Movies are different that those who prefer TV Shows. Some users prefer shorter format of shows to be watched over series of episodes, while some prefer spending some more time and watching a Movie. The users also has their own favorites when it comes to actors and directors. The directors and actors working in Movies are generally different from TV Shows. Therefore, in this study we will separate the data with respect to show type and analyze each variable to understand user preferences.

We will carry out the following steps:

  • Observe shape of data, data types of all the attributes
  • Data Cleaning, missing value detection, outlier check, statistical summary, make the units uniform
  • Data type conversion
  • Reclassify to get new/ merged categories, conversion of categorical attributes to 'category'
  • Un-nest the nested data variables
  • Non-graphical analysis: Value counts and unique attributes
  • Visual analysis - Univariate, Bivariate analysis using distplot, countplot, histogram, boxplot, heatmaps, and pairplots
  • Understand the patterns and preferences
  • Generate insights and business recommendation

Download the dataset and observe a subset of the data¶

Importing Required Libraries

In [1]:
# !pip install matplotlib==3.5.3
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from matplotlib.ticker import (MultipleLocator, AutoMinorLocator)
import seaborn as sns
import textwrap

Download the DataSet

In [2]:
df = pd.read_csv('/content/Streaming_Data.csv')
df.head()
Out[2]:
show_id type title director cast country date_added release_year rating duration listed_in description
0 s1 Movie Dick Johnson Is Dead Kirsten Johnson NaN United States September 25, 2021 2020 PG-13 90 min Documentaries As her father nears the end of his life, filmm...
1 s2 TV Show Blood & Water NaN Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban... South Africa September 24, 2021 2021 TV-MA 2 Seasons International TV Shows, TV Dramas, TV Mysteries After crossing paths at a party, a Cape Town t...
2 s3 TV Show Ganglands Julien Leclercq Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi... NaN September 24, 2021 2021 TV-MA 1 Season Crime TV Shows, International TV Shows, TV Act... To protect his family from a powerful drug lor...
3 s4 TV Show Jailbirds New Orleans NaN NaN NaN September 24, 2021 2021 TV-MA 1 Season Docuseries, Reality TV Feuds, flirtations and toilet talk go down amo...
4 s5 TV Show Kota Factory NaN Mayur More, Jitendra Kumar, Ranjan Raj, Alam K... India September 24, 2021 2021 TV-MA 2 Seasons International TV Shows, Romantic TV Shows, TV ... In a city of coaching centers known to train I...

2. Observations on the shape of data, data types of all the attributes, conversion of categorical attributes to 'category' (If required), missing value detection, statistical summary¶

Observe shape of data, data types of all the attributes¶

In [ ]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB

Shape:

  • DataSet has 12 columns (attributes)
  • DataSet has 8807 data points

Data Type:

  • Attribute release_year is of int64 Dtype
  • All the remaining attributes are of object Dtype.

Numeric Variable (release_year): Missing value detection, statistical summary¶

In [ ]:
df.describe()
Out[ ]:
release_year
count 8807.000000
mean 2014.180198
std 8.819312
min 1925.000000
25% 2013.000000
50% 2017.000000
75% 2019.000000
max 2021.000000
In [3]:
df['release_year'].isna().sum()
Out[3]:
0

Missing Values:

  • There are no missing Values
  • There are no outliers like 0, -999, etc.

Statistical Summary:

  • Data is from the year 1925 to 2021
  • Data Distribution
    • 50% of the data is of the shows that are released from 2017 till 2021 (~5 yrs)
    • 25% of the data is of the shows that are released from 2013 till 2017 (~4 yrs)
    • 25% of the data is of the shows that are released from 1925 till 2013 (~88 yrs)

Non-Numeric Variables: Data Cleaning, Missing value detection, statistical summary, conversion of categorical attributes to 'category'¶

Data Cleaning:

  • Striping blank spaces at the start and the end
  • Treat blank spaces as missing values
In [ ]:
# Striping the values in Non-Numeric columns
df1 = df.drop('release_year', axis=1).apply(lambda x: x.str.rstrip()).apply(lambda x: x.str.lstrip())
# Assign to NaN
df1 = df1.apply(lambda series_: series_.apply(lambda value_: np.nan if value_=='' else value_))
# Update the DF
df[df1.columns] = df1
del df1
df.head()
Out[ ]:
show_id type title director cast country date_added release_year rating duration listed_in description
0 s1 Movie Dick Johnson Is Dead Kirsten Johnson NaN United States September 25, 2021 2020 PG-13 90 min Documentaries As her father nears the end of his life, filmm...
1 s2 TV Show Blood & Water NaN Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban... South Africa September 24, 2021 2021 TV-MA 2 Seasons International TV Shows, TV Dramas, TV Mysteries After crossing paths at a party, a Cape Town t...
2 s3 TV Show Ganglands Julien Leclercq Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi... NaN September 24, 2021 2021 TV-MA 1 Season Crime TV Shows, International TV Shows, TV Act... To protect his family from a powerful drug lor...
3 s4 TV Show Jailbirds New Orleans NaN NaN NaN September 24, 2021 2021 TV-MA 1 Season Docuseries, Reality TV Feuds, flirtations and toilet talk go down amo...
4 s5 TV Show Kota Factory NaN Mayur More, Jitendra Kumar, Ranjan Raj, Alam K... India September 24, 2021 2021 TV-MA 2 Seasons International TV Shows, Romantic TV Shows, TV ... In a city of coaching centers known to train I...

Missing Values

In [ ]:
df.isna().sum()
Out[ ]:
show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64

No missing values in columns

  • show_id
  • type
  • title
  • listed_in
  • description

Some columns have missing values

  • dirctor:  $\;$2634
  • country:  $\;\;$831
  • cast:    825
  • date_added: $\;$10
  • rating:    $\;$4
  • duration:   $\;$3

Statistical Summary

In [ ]:
df.describe(include=object)
Out[ ]:
show_id type title director cast country date_added rating duration listed_in description
count 8807 8807 8807 6173 7982 7976 8797 8803 8804 8807 8807
unique 8807 2 8806 4528 7692 748 1714 17 220 514 8775
top s1 Movie Consequences Rajiv Chilaka David Attenborough United States January 1, 2020 TV-MA 1 Season Dramas, International Movies Paranormal activity at a lush, abandoned prope...
freq 1 6131 2 19 19 2818 110 3207 1793 362 4

show_id:

  • All the values are unique

type:

  • It has two categories

title:

  • There is a duplicate title: 'Consequences'

director:

  • There are 4528 unique combination of directors
  • Rajiv Chilaka alone has directed the most (19) shows

cast:

  • There are 7692 unique combination of cast
  • David Attenborough as a single cast has acted in the most (19) shows

country:

  • Shows were produced in 748 different combinations of countries
  • United States alone has produced the most (2818) shows

date_added:

  • Out of 8797 available dates, there are 1714 unique date_added
  • On January 1, 2020, most (110) shows were added

rating:

  • There are 17 unique categories of rating
  • Most (3207) shows have rating: TV-MA

duration:

  • Out of 8797 available durations, there are 220 unique durations
  • Season 1 is the most repeated (1793) duration

listed_in:

  • Out of 8807 available genre combinations, there are 514 unique genre combinations
  • Most of the shows (362) fall under 'Dramas, International Movies' Genre

description:

- Most of the shows have unique descriptions¶

Conversion of Categorical Attributes to 'category'

  • Convert the show type column Dtype to category as it has two categoreis: Movie and TV Show
  • Combine the ratings that have same meaning and create new ratings for the analysis. Then convert the rating Dtype to category
In [ ]:
# Convert show type to category Dtype
df['type'] = df['type'].astype('category')

# Combine ratings having same meaning
df.loc[df['rating'].isin(['TV-MA', 'R', 'NC-17']), 'rating'] = '17+'
df.loc[df['rating'].isin(['G', 'TV-G']), 'rating'] = 'General'
df.loc[df['rating'].isin(['NR', 'UR']), 'rating'] = 'Not Rated'
df.loc[df['rating']=='TV-Y', 'rating'] = 'All Kids'
df.loc[df['rating'].isin(['TV-PG', 'PG']), 'rating'] = 'Parental Guidance'
df.loc[df['rating']=='PG-13', 'rating'] = '13+'
df.loc[df['rating']=='TV-14', 'rating'] = '14+'
df.loc[df['rating'].isin(['TV-Y7', 'TV-Y7-FV']), 'rating'] = '7+'

# Convert rating column Dtype to category
df['rating'] = df['rating'].astype('category')

5. Missing Value & Outlier check (Treatment optional)¶

For Variable: director, cast and country¶

To find the best guess for the missing values, I tried to creat diffrent groups and observed for any pattern but I did not find any pattern.

Assign missing director, cast and country values as Unknown

In [ ]:
df[['director', 'cast', 'country']] = df[['director', 'cast', 'country']].fillna('Unknown')

For Variable: date_added¶

Convert the date_added column Dtype to datetime

To find the best guess for the missing date_added, I tried to creat diffrent groups and observed if ther is any pattern/ correlation with release date, type, directory, cast and country; but I could not find any.

Since missing date_added values are less in numbers (10). I would keep it as NaN.

Check if the date_added is before the release_year of any show. If it is so, it is a wrong information. We are not sure if the release_year or date_added is correct. Therefore for such records we make those values NaN.

In [ ]:
df['date_added'] = pd.to_datetime(df['date_added'])
In [ ]:
df.loc[df['date_added'].dt.year < df['release_year'], ['date_added', 'release_year']] = np.nan

For Variable: duration¶

In [ ]:
df.loc[df['duration'].isna()]
Out[ ]:
show_id type title director cast country date_added release_year rating duration listed_in description
5541 s5542 Movie Louis C.K. 2017 Louis C.K. Louis C.K. United States 2017-04-04 2017.0 74 min NaN Movies Louis C.K. muses on religion, eternal love, gi...
5794 s5795 Movie Louis C.K.: Hilarious Louis C.K. Louis C.K. United States 2016-09-16 2010.0 84 min NaN Movies Emmy-winning comedy writer Louis C.K. brings h...
5813 s5814 Movie Louis C.K.: Live at the Comedy Store Louis C.K. Louis C.K. United States 2016-08-15 2015.0 66 min NaN Movies The comic puts his trademark hilarious/thought...

Duration values of the shows are shifted to rating columns. Replace the duration from rating column to duration column


In [ ]:
df.loc[df['duration'].isna(), ['rating', 'duration']] = df.loc[df['duration'].isna(), ['duration', 'rating']].values

Recalculate rating categories

In [ ]:
df['rating'] = df['rating'].astype('string')
df['rating'] = df['rating'].astype('category')

Make the duration units uniform

  • Check different units available for type: Movie and TV Show
  • Remove the units and convert the Dtype to Numeric

For type: Movie

check for different units used

In [ ]:
df.loc[df['type']=='Movie', 'duration'].str.split(' ').apply(lambda x: '-'.join([i for i in x if i.isalpha()])).unique()
Out[ ]:
array(['min'], dtype=object)

All the durations of type Movie are in min


Check if all the values in the durtion column contains only one numeric value

In [ ]:
df.loc[df['type']=='Movie', 'duration'].str.split(' ').apply(lambda x: len([i for i in x if i.isnumeric()])==1).all()
Out[ ]:
True

Keep only the numeric values

In [ ]:
df.loc[df['type']=='Movie', 'duration'] = df.loc[df['type']=='Movie', 'duration'].str.split(' ').apply(lambda x: [i for i in x if i.isnumeric()][0])

For type: TV Show

check for different units used

In [ ]:
df.loc[df['type']=='TV Show', 'duration'].str.split(' ').apply(lambda x: '-'.join([i for i in x if i.isalpha()])).unique()
Out[ ]:
array(['Seasons', 'Season'], dtype=object)

All the durations of type TV Show are in either Season for 1 season and Seasons for multiple seasons


Check if all the values in the durtion column contains only one numeric value

In [ ]:
df.loc[df['type']=='TV Show', 'duration'].str.split(' ').apply(lambda x: len([i for i in x if i.isnumeric()])==1).all()
Out[ ]:
True

Keep only the numeric values

In [ ]:
df.loc[df['type']=='TV Show', 'duration'] = df.loc[df['type']=='TV Show', 'duration'].str.split(' ').apply(lambda x: [i for i in x if i.isnumeric()][0])

Convert the Dtype to integet

In [ ]:
df['duration'] = df['duration'].astype('int')

For Variable: rating¶

Assign Missing Rating values to Not Rated Category

In [ ]:
df.loc[df['rating'].isna(), 'rating'] = 'Not Rated'

Check for missing values¶

In [ ]:
df.isna().sum()
Out[ ]:
show_id          0
type             0
title            0
director         0
cast             0
country          0
date_added      24
release_year    14
rating           0
duration         0
listed_in        0
description      0
dtype: int64

As discussed, all the missing values are treated. As discussed there are some missing values in date_added and release_year columns.


Check for outlier values¶

In [ ]:
df.groupby('type').describe()
Out[ ]:
release_year duration
count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max
type
Movie 6129.0 2013.119759 9.679256 1942.0 2012.0 2016.0 2018.0 2021.0 6131.0 99.564998 28.289504 3.0 87.0 98.0 114.0 312.0
TV Show 2664.0 2016.593468 5.749193 1925.0 2016.0 2018.0 2020.0 2021.0 2676.0 1.764948 1.582752 1.0 1.0 1.0 2.0 17.0
In [ ]:
df.groupby('type').describe(include=np.datetime64, datetime_is_numeric=True)
Out[ ]:
date_added
count mean min 25% 50% 75% max
type
Movie 6129 2019-05-07 06:56:47.929515264 2008-01-01 2018-04-01 00:00:00 2019-06-20 00:00:00 2020-07-24 2021-09-25
TV Show 2654 2019-06-10 13:43:05.380557568 2008-02-04 2018-04-21 18:00:00 2019-08-15 12:00:00 2020-10-01 2021-09-24
In [ ]:
df.groupby('type').describe(include='category')
Out[ ]:
rating
count unique top freq
type
Movie 6131 8 17+ 2862
TV Show 2676 7 17+ 1147
In [ ]:
df.loc[df['director']!='Unknown', 'director'].describe(include=object)
Out[ ]:
count              6173
unique             4528
top       Rajiv Chilaka
freq                 19
Name: director, dtype: object
In [ ]:
df.loc[df['cast']!='Unknown', 'cast'].describe(include=object)
Out[ ]:
count                   7982
unique                  7692
top       David Attenborough
freq                      19
Name: cast, dtype: object
In [ ]:
df.loc[df['country']!='Unknown', 'country'].describe(include=object)
Out[ ]:
count              7976
unique              748
top       United States
freq               2818
Name: country, dtype: object
In [ ]:
df.loc[df['listed_in']!='Unknown', 'listed_in'].describe(include=object)
Out[ ]:
count                             8807
unique                             514
top       Dramas, International Movies
freq                               362
Name: listed_in, dtype: object

Check Movies with title 'Consequences' as there are two movies with same title


In [ ]:
df.loc[df['title']=='Consequences']
Out[ ]:
show_id type title director cast country date_added release_year rating duration listed_in description
3371 s3372 Movie Consequences Ozan Açıktan Nehir Erdoğan, Tardu Flordun, İlker Kaleli, Se... Turkey 2019-10-25 2014.0 17+ 106 Dramas, International Movies, Thrillers Secrets bubble to the surface after a sensual ...
6529 s6530 Movie Consequences Ozan Açıktan Nehir Erdoğan, Tardu Flordun, İlker Kaleli, Se... Turkey 2019-10-25 2014.0 17+ 106 Dramas, International Movies, Thrillers Secrets bubble to the surface after a sensual ...

Remove the duplicate movie record

In [ ]:
df = df.drop_duplicates(subset='title')

3. Non-Graphical Analysis: Value counts and unique attributes¶

Types of show (Movie/ TV Show); Unique Counts¶

In [ ]:
df['type'].value_counts()
Out[ ]:
Movie      6130
TV Show    2676
Name: type, dtype: int64
In [ ]:
(df['type'].value_counts()/ df.shape[0] * 100).round(2)
Out[ ]:
Movie      69.61
TV Show    30.39
Name: type, dtype: float64
In [ ]:
df.groupby('type')[['show_id', 'title']].nunique()
Out[ ]:
show_id title
type
Movie 6130 6130
TV Show 2676 2676

Columns show_id and title are unique for each show

There are 70% Movies (6130) and 30% TV Shows (2676)


Value Counts of Combination of Directors by Show Type¶

In [ ]:
df.loc[df['type']=='Movie', 'director'].value_counts()
Out[ ]:
Unknown                       188
Rajiv Chilaka                  19
Raúl Campos, Jan Suter         18
Suhas Kadav                    16
Marcus Raboy                   15
                             ... 
Sion Sono                       1
Dennis Rovira van Boekholt      1
Naoto Amazutsumi                1
Jenny Gage                      1
Mozez Singh                     1
Name: director, Length: 4355, dtype: int64
In [ ]:
df.loc[df['type']=='TV Show', 'director'].value_counts()
Out[ ]:
Unknown                              2446
Alastair Fothergill                     3
Hsu Fu-chun                             2
Iginio Straffi                          2
Shin Won-ho                             2
                                     ... 
Juliana Vicente                         1
Chang Chin-jung, Chen Rong-hui          1
Thierry Demaizière, Alban Teurlai       1
Manolo Caro                             1
Michael Cumming                         1
Name: director, Length: 223, dtype: int64

Value Counts of Combination of Cast by Show Type¶

In [ ]:
df.loc[df['type']=='Movie', 'cast'].value_counts()
Out[ ]:
Unknown                                                                                                                                                                               475
Vatsal Dubey, Julie Tejwani, Rupa Bhimani, Jigna Bhardwaj, Rajesh Kava, Mousam, Swapnil                                                                                                13
Samuel West                                                                                                                                                                            10
Jeff Dunham                                                                                                                                                                             7
Kevin Hart                                                                                                                                                                              6
                                                                                                                                                                                     ... 
Hánssel Casillas, Loreto Peralta, Andrea Sutton, Luis De La Rosa, Alejandro Flores, Anajosé Aldrete, Daniel Haddad, Montserrat Marañón, Juan Carlos Viana Prieto, Hernán Del Riego      1
Oscar Martínez, Inma Cuesta, Mafalda Carbonell, Nacho López, Isabel Requena, Aina Clotet, Antonio Valero, María Zamora, Hugo Balaguer, Valeria Schoneveld                               1
Casey Affleck, Rooney Mara, Liz Franke, Rob Zabrecky, Will Oldham, Sonia Acevedo                                                                                                        1
Faizon Love, Kirk Fox, Tony Todd, Pearl Thusi, Cassie Clare, Fiona Ramsey, Neels Clasen, Roxy Nel                                                                                       1
Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanana, Manish Chaudhary, Meghna Malik, Malkeet Rauni, Anita Shabdish, Chittaranjan Tripathy                                                   1
Name: cast, Length: 5446, dtype: int64
In [ ]:
df.loc[df['type']=='TV Show', 'cast'].value_counts()
Out[ ]:
Unknown                                                                                                                                                             350
David Attenborough                                                                                                                                                   14
Michela Luci, Jamie Watson, Anna Claire Bartlam, Dante Zee, Eric Peterson                                                                                             4
Dave Chappelle                                                                                                                                                        3
You, Reina Triendl, Yoshimi Tokui, Azusa Babazono, Ryota Yamasato, Shono Hayama                                                                                       2
                                                                                                                                                                   ... 
Vir Das, Ranvir Shorey, Manoj Pahwa, Ravi Kishan, Raza Murad, Inaamulhaq, Suhail Nayyar, Amrita Bagchi                                                                1
Sebastián Rulli, Renata Notni, Roberto Mateos, Irina Baeva, Cassandra Sánchez Navarro, Manuel Balbi, Javier Gómez, Alejandro Ávila, Sofía Castro, Juan Pablo Gil      1
Kenya Barris, Rashida Jones, Iman Benson, Genneya Walton, Scarlet Spencer, Justin Claiborne, Ravi Cabot-Conyers                                                       1
Lee Min-ho, Kim Go-eun, Woo Do‑hwan, Kim Kyung-nam, Jung Eun-chae, Lee Jung-jin, Kim Young-ok                                                                         1
Sanam Saeed, Fawad Khan, Ayesha Omer, Mehreen Raheel, Sheheryar Munawar, Samina Peerzada, Waseem Abbas, Javed Sheikh, Hina Khawaja Bayat                              1
Name: cast, Length: 2284, dtype: int64

Value Counts of Combination of Country by Show Type¶

In [ ]:
df.loc[df['type']=='Movie', 'country'].value_counts()
Out[ ]:
United States                            2058
India                                     893
Unknown                                   440
United Kingdom                            206
Canada                                    122
                                         ... 
United Kingdom, Russia, United States       1
Paraguay, Argentina                         1
United Kingdom, Malawi                      1
Austria, Iraq, United States                1
United Arab Emirates, Jordan                1
Name: country, Length: 652, dtype: int64
In [ ]:
df.loc[df['type']=='TV Show', 'country'].value_counts()
Out[ ]:
United States                                    760
Unknown                                          391
United Kingdom                                   213
Japan                                            169
South Korea                                      158
                                                ... 
Belarus                                            1
United Kingdom, Australia                          1
France, Australia, Germany                         1
Australia, New Zealand, United States              1
United States, France, South Korea, Indonesia      1
Name: country, Length: 197, dtype: int64

Value Counts of Ratings by Show Type¶

In [ ]:
df.loc[df['type']=='Movie', 'rating'].value_counts()
Out[ ]:
17+                  2861
14+                  1427
Parental Guidance     827
13+                   490
General               167
7+                    144
All Kids              131
Not Rated              83
Name: rating, dtype: int64
In [ ]:
df.loc[df['type']=='TV Show', 'rating'].value_counts()
Out[ ]:
17+                  1147
14+                   733
Parental Guidance     323
7+                    196
All Kids              176
General                94
Not Rated               7
13+                     0
Name: rating, dtype: int64

Value Counts of Combinations of Gener by Show Type¶

In [ ]:
df.loc[df['type']=='Movie', 'listed_in'].value_counts()
Out[ ]:
Dramas, International Movies                        362
Documentaries                                       359
Stand-Up Comedy                                     334
Comedies, Dramas, International Movies              274
Dramas, Independent Movies, International Movies    252
                                                   ... 
Sci-Fi & Fantasy                                      1
Sports Movies                                         1
Children & Family Movies, Comedies, Cult Movies       1
Cult Movies, Dramas, Music & Musicals                 1
Cult Movies, Dramas, Thrillers                        1
Name: listed_in, Length: 278, dtype: int64
In [ ]:
df.loc[df['type']=='TV Show', 'listed_in'].value_counts()
Out[ ]:
Kids' TV                                                 220
International TV Shows, TV Dramas                        121
Crime TV Shows, International TV Shows, TV Dramas        110
Kids' TV, TV Comedies                                     99
Reality TV                                                95
                                                        ... 
Kids' TV, TV Action & Adventure, TV Dramas                 1
British TV Shows, Kids' TV, TV Thrillers                   1
Reality TV, TV Horror, TV Thrillers                        1
TV Action & Adventure, TV Horror, TV Sci-Fi & Fantasy      1
Classic & Cult TV, Crime TV Shows, TV Dramas               1
Name: listed_in, Length: 236, dtype: int64

Distribution of Date Added by Show Type¶

In [ ]:
df.loc[df['type']=='Movie', 'date_added'].describe(datetime_is_numeric=True)
Out[ ]:
count                             6128
mean     2019-05-07 06:16:41.044386560
min                2008-01-01 00:00:00
25%                2018-04-01 00:00:00
50%                2019-06-19 12:00:00
75%                2020-07-24 00:00:00
max                2021-09-25 00:00:00
Name: date_added, dtype: object
In [ ]:
df.loc[df['type']=='TV Show', 'date_added'].describe(datetime_is_numeric=True)
Out[ ]:
count                             2654
mean     2019-06-10 13:43:05.380557568
min                2008-02-04 00:00:00
25%                2018-04-21 18:00:00
50%                2019-08-15 12:00:00
75%                2020-10-01 00:00:00
max                2021-09-24 00:00:00
Name: date_added, dtype: object

Distribution of Release Year by Show Type¶

In [ ]:
df.loc[df['type']=='Movie', 'release_year'].describe().round()
Out[ ]:
count    6128.0
mean     2013.0
std        10.0
min      1942.0
25%      2012.0
50%      2016.0
75%      2018.0
max      2021.0
Name: release_year, dtype: float64
In [ ]:
df.loc[df['type']=='TV Show', 'release_year'].describe().round()
Out[ ]:
count    2664.0
mean     2017.0
std         6.0
min      1925.0
25%      2016.0
50%      2018.0
75%      2020.0
max      2021.0
Name: release_year, dtype: float64

Distribution of Duration by Show Type¶

In [ ]:
df.loc[df['type']=='Movie', 'duration'].describe().round()
Out[ ]:
count    6130.0
mean      100.0
std        28.0
min         3.0
25%        87.0
50%        98.0
75%       114.0
max       312.0
Name: duration, dtype: float64
In [ ]:
df.loc[df['type']=='TV Show', 'duration'].describe().round()
Out[ ]:
count    2676.0
mean        2.0
std         2.0
min         1.0
25%         1.0
50%         1.0
75%         2.0
max        17.0
Name: duration, dtype: float64

4. Visual Analysis - Univariate, Bivariate after pre-processing of the data: Preprocessing¶

Define a Function to Split Nested Data Column into Rows

In [ ]:
df.head()
Out[ ]:
show_id type title director cast country date_added release_year rating duration listed_in description
0 s1 Movie Dick Johnson Is Dead Kirsten Johnson Unknown United States 2021-09-25 2020.0 13+ 90 Documentaries As her father nears the end of his life, filmm...
1 s2 TV Show Blood & Water Unknown Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban... South Africa 2021-09-24 2021.0 17+ 2 International TV Shows, TV Dramas, TV Mysteries After crossing paths at a party, a Cape Town t...
2 s3 TV Show Ganglands Julien Leclercq Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi... Unknown 2021-09-24 2021.0 17+ 1 Crime TV Shows, International TV Shows, TV Act... To protect his family from a powerful drug lor...
3 s4 TV Show Jailbirds New Orleans Unknown Unknown Unknown 2021-09-24 2021.0 17+ 1 Docuseries, Reality TV Feuds, flirtations and toilet talk go down amo...
4 s5 TV Show Kota Factory Unknown Mayur More, Jitendra Kumar, Ranjan Raj, Alam K... India 2021-09-24 2021.0 17+ 2 International TV Shows, Romantic TV Shows, TV ... In a city of coaching centers known to train I...
In [ ]:
def unnested_column(df, id_col, nested_col):
  df1 = pd.concat((df[id_col], df[nested_col].str.split(',')), axis=1).explode(nested_col)
  ''' Alternate approach
  df1 = pd.concat((df[id_col], df[nested_col].str.split(',', expand=True)), axis=1)
  df1 = pd.melt(df1.reset_index(), id_vars=['index', id_col], value_name=nested_col).dropna().sort_values('index').drop(['index', 'variable'], axis=1)
  '''
  df1[nested_col] = df1[nested_col].str.lstrip().str.rstrip()
  df = df.drop(nested_col, axis=1)
  return [df, df1]

Unnest director, cast, country and listed_in Columns

In [ ]:
# df, df_listed_in = unnested_column(df, 'show_id', 'listed_in')
# df, df_director = unnested_column(df, 'show_id', 'director')
# df, df_cast = unnested_column(df, 'show_id', 'cast')
# df, df_country = unnested_column(df, 'show_id', 'country')

Merge the unnested columns with the main df

In [ ]:
# df = pd.merge(df, df_director, on='show_id')
# df = pd.merge(df, df_cast, on='show_id')
# df = pd.merge(df, df_country, on='show_id')
# df = pd.merge(df, df_listed_in, on='show_id')
# del df_director, df_cast, df_country, df_listed_in

Alternate and efficient approach to get unnested columns and then melt is to use split explode

In [ ]:
def unnest_column_approach01(df_, id_col, nested_col):
  df_ = pd.concat((df_.drop(nested_col, axis=1), df_[nested_col].str.split(',')), axis=1).explode(nested_col)
  df_[nested_col] = df_[nested_col].str.lstrip().str.rstrip()
  return df_.reset_index(drop=True)
In [ ]:
df = unnest_column_approach01(df, 'show_id', 'listed_in')
df = unnest_column_approach01(df, 'show_id', 'director')
df = unnest_column_approach01(df, 'show_id', 'cast')
df = unnest_column_approach01(df, 'show_id', 'country')

Rearrange the Columns

In [ ]:
df.head()
Out[ ]:
show_id type title date_added release_year rating duration description listed_in director cast country
0 s1 Movie Dick Johnson Is Dead 2021-09-25 2020.0 13+ 90 As her father nears the end of his life, filmm... Documentaries Kirsten Johnson Unknown United States
1 s2 TV Show Blood & Water 2021-09-24 2021.0 17+ 2 After crossing paths at a party, a Cape Town t... International TV Shows Unknown Ama Qamata South Africa
2 s2 TV Show Blood & Water 2021-09-24 2021.0 17+ 2 After crossing paths at a party, a Cape Town t... International TV Shows Unknown Khosi Ngema South Africa
3 s2 TV Show Blood & Water 2021-09-24 2021.0 17+ 2 After crossing paths at a party, a Cape Town t... International TV Shows Unknown Gail Mabalane South Africa
4 s2 TV Show Blood & Water 2021-09-24 2021.0 17+ 2 After crossing paths at a party, a Cape Town t... International TV Shows Unknown Thabang Molaba South Africa
In [ ]:
df = df[['show_id', 'type', 'title', 'director', 'cast', 'country', 'listed_in', 'duration', 'rating', 'date_added', 'release_year', 'description']]
In [ ]:
df.head(10)
Out[ ]:
show_id type title director cast country listed_in duration rating date_added release_year description
0 s1 Movie Dick Johnson Is Dead Kirsten Johnson Unknown United States Documentaries 90 13+ 2021-09-25 2020.0 As her father nears the end of his life, filmm...
1 s2 TV Show Blood & Water Unknown Ama Qamata South Africa International TV Shows 2 17+ 2021-09-24 2021.0 After crossing paths at a party, a Cape Town t...
2 s2 TV Show Blood & Water Unknown Khosi Ngema South Africa International TV Shows 2 17+ 2021-09-24 2021.0 After crossing paths at a party, a Cape Town t...
3 s2 TV Show Blood & Water Unknown Gail Mabalane South Africa International TV Shows 2 17+ 2021-09-24 2021.0 After crossing paths at a party, a Cape Town t...
4 s2 TV Show Blood & Water Unknown Thabang Molaba South Africa International TV Shows 2 17+ 2021-09-24 2021.0 After crossing paths at a party, a Cape Town t...
5 s2 TV Show Blood & Water Unknown Dillon Windvogel South Africa International TV Shows 2 17+ 2021-09-24 2021.0 After crossing paths at a party, a Cape Town t...
6 s2 TV Show Blood & Water Unknown Natasha Thahane South Africa International TV Shows 2 17+ 2021-09-24 2021.0 After crossing paths at a party, a Cape Town t...
7 s2 TV Show Blood & Water Unknown Arno Greeff South Africa International TV Shows 2 17+ 2021-09-24 2021.0 After crossing paths at a party, a Cape Town t...
8 s2 TV Show Blood & Water Unknown Xolile Tshabalala South Africa International TV Shows 2 17+ 2021-09-24 2021.0 After crossing paths at a party, a Cape Town t...
9 s2 TV Show Blood & Water Unknown Getmore Sithole South Africa International TV Shows 2 17+ 2021-09-24 2021.0 After crossing paths at a party, a Cape Town t...

4.1 For continuous variable(s): Distplot, countplot, histogram for univariate analysis¶

Univariate Analysis¶

Define a function to Dist Plot and Count Plot by Type of Show

In [ ]:
def plt_dist_plot(ax_, df_, var_, title_, xlabel_, ylabel_):
  df_sub_count = df_[['type', var_]].value_counts()
  df_sub_count.name = xlabel_
  df_sub_count = df_sub_count.reset_index()
  sns.kdeplot(data=df_sub_count, x=xlabel_, ax=ax_,
                   hue='type', hue_order=['TV Show', 'Movie'], lw=4, palette=['#3A5BA0', '#DC3535'])
  plt.sca(ax_)
  plt.title(title_, size=16, color='grey')
  plt.xlabel(xlabel_, size=12)
  plt.ylabel(ylabel_, size=12)
  plt.xticks(size=12)
  plt.yticks(size=12)
  plt.legend(labels=['Movie', 'TV Show'], loc='best', fontsize=12)

  sns.despine(right=True, top=True, ax=ax_)
  ax_.spines['left'].set_color('grey')
  ax_.spines['bottom'].set_color('grey')

  ax_.figure.set_size_inches(16,8)
  ax_.figure.subplots_adjust(top=0.81,right=0.86) 

  return plt
In [ ]:
def plt_most10_count_plot(ax_, df_, var_, show_type, title_, independant_label_, dependant_label_, unit_='', highlight_num=3, highlight_color='#DC3535', orientation_='v'):
  list_most10 = df_.loc[df_['type']==show_type, var_].value_counts().iloc[:10].index
  df_ = df_.loc[(df_['type']==show_type) & df_[var_].isin(list_most10), [var_]]

  custom_palette = [highlight_color for i in range(highlight_num)] + ['grey' for i in range(10-highlight_num)]
  if orientation_ == 'v':
    sns.countplot(data=df_, x=var_, order=list_most10, ax=ax_,
                      palette = custom_palette)
    
    plt.sca(ax_)
    plt.xlabel(independant_label_)
    plt.ylabel(dependant_label_)

    plt.tick_params(
      axis='y',          # changes apply to the y-axis
      which='both',      # both major and minor ticks are affected
      left=False,      # ticks along the left edge are off
      right=False,         # ticks along the right edge are off
      labelleft=False) # labels along the left edge are off

    labels = []
    for label in ax_.get_xticklabels():
        text = label.get_text()
        labels.append(textwrap.fill(text, width=10,
                      break_long_words=False))
    ax_.set_xticklabels(labels, rotation=0)

    bar_labels=[]
    for container in ax_.containers:
      for rect in container:
        # Rectangle widths are already integer-valued but are floating
        # type, so it helps to remove the trailing decimal point and 0 by
        # converting width to int type

        # Shift the text to the left side of the right edge
        yloc = 4
        # White on magenta
        clr = 'white'
        align = 'bottom'
        rotation_ = 90

        label_text = '' if not rect.get_height() else round(rect.get_height()) if rect.get_height() > 0 and rect.get_height() < 1000 else f'{rect.get_height()/1000: .1f} k'
        label_text = f'{label_text} {unit_}' if label_text and rect.get_height()/container[0].get_height() > 0.05+0.022*len(unit_) else label_text if rect.get_height()/container[0].get_height() > 0.05 else ''

        size_ = 12 if rect.get_height()/container[0].get_height() > 0.1 else 9 if rect.get_height()/container[0].get_height() > 0.06 else 8 if rect.get_height()/container[0].get_height() > 0.055 else 7

        # Center the text vertically in the bar
        xloc = rect.get_x() + rect.get_width() / 2
        ax_.annotate(label_text, xy=(xloc, 0), xytext=(0, yloc),
                            textcoords="offset points", size=size_,
                            ha='center', va=align, rotation=rotation_,
                            color=clr, clip_on=True)

  else:
    sns.countplot(data=df_, y=var_, order=list_most10, ax=ax_,
                      palette = custom_palette)
    plt.sca(ax_)
    plt.xlabel(dependant_label_)
    plt.ylabel(independant_label_)

    plt.tick_params(
      axis='x',          # changes apply to the x-axis
      which='both',      # both major and minor ticks are affected
      bottom=False,      # ticks along the bottom edge are off
      top=False,         # ticks along the top edge are off
      labelbottom=False) # labels along the bottom edge are off

    labels = []
    for label in ax_.get_yticklabels():
        text = label.get_text()
        labels.append(textwrap.fill(text, width=15,
                      break_long_words=False))
    ax_.set_yticklabels(labels, rotation=0)

    bar_labels=[]
    for container in ax_.containers:
      for rect in container:
        # Rectangle widths are already integer-valued but are floating
        # type, so it helps to remove the trailing decimal point and 0 by
        # converting width to int type

        # Shift the text to the left side of the right edge
        xloc = 2
        # White on magenta
        clr = 'white'
        align = 'left'

        label_text = '' if not rect.get_width() else round(rect.get_width()) if rect.get_width() > 0 and rect.get_width() < 1000 else f'{rect.get_width()/1000: .1f} k'
        label_text = f'{label_text} {unit_}' if label_text and rect.get_width()/container[0].get_width() > 0.05+0.022*len(unit_) else label_text if rect.get_width()/container[0].get_width() > 0.05 else ''

        size_ = 12 if rect.get_width()/container[0].get_width() > 0.1 else 9 if rect.get_width()/container[0].get_width() > 0.06 else 8 if rect.get_width()/container[0].get_width() > 0.055 else 7

        # Center the text vertically in the bar
        yloc = rect.get_y() + rect.get_height() / 2
        ax_.annotate(label_text, xy=(0, yloc), xytext=(xloc, 0),
                            textcoords="offset points", size=size_,
                            ha=align, va='center',
                            color=clr, clip_on=True)

  sns.despine(left=True, bottom=True, ax=ax_)

  plt.title(title_, size=16, color='grey')
  plt.xticks(size=12)
  plt.yticks(size=12)
  
  ax_.spines['left'].set_color('grey')
  ax_.spines['bottom'].set_color('grey')

  ax_.figure.set_size_inches(16,8)
  ax_.figure.subplots_adjust(top=0.81,right=0.86) 

  return plt

Director by Type of Show

In [ ]:
df_sub = df[['show_id', 'type', 'director']].drop(df.loc[df['director']=='Unknown'].index).drop_duplicates()
fig, axd = plt.subplot_mosaic([['upper', 'upper'],
                               ['lower left', 'lower right']])
ax1 = axd['upper']
ax2 = axd['lower left']
ax3 = axd['lower right']

plt_dist_plot(ax_=ax1, df_=df_sub, var_='director', title_='Shows per Director Count', xlabel_='Shows per Director', ylabel_='Density')
plt_most10_count_plot(ax_=ax2, df_=df_sub, var_='director', show_type='Movie', title_='Most Watched Directors', independant_label_='', dependant_label_='', unit_='Movies', highlight_num=3, highlight_color='#DC3535', orientation_='h')
plt_most10_count_plot(ax_=ax3, df_=df_sub, var_='director', show_type='TV Show', title_='Most Watched Directors', independant_label_='', dependant_label_='', unit_='TV Shows', highlight_num=2, highlight_color='#3A5BA0', orientation_='h')

fig.set_size_inches(16,16)
plt.subplots_adjust(hspace = 0.25, wspace=0.25)

plt.show()

Cast by Type of Show

In [ ]:
df_sub = df[['show_id', 'type', 'cast']].drop(df.loc[df['cast']=='Unknown'].index).drop_duplicates()
fig, axd = plt.subplot_mosaic([['upper', 'upper'],
                               ['lower left', 'lower right']])
ax1 = axd['upper']
ax2 = axd['lower left']
ax3 = axd['lower right']

plt_dist_plot(ax_=ax1, df_=df_sub, var_='cast', title_='Shows per Actor Count', xlabel_='Shows per Actor', ylabel_='Density')
plt_most10_count_plot(ax_=ax2, df_=df_sub, var_='cast', show_type='Movie', title_='Most Watched Actors', independant_label_='', dependant_label_='', unit_='Movies', highlight_num=5, highlight_color='#DC3535', orientation_='h')
plt_most10_count_plot(ax_=ax3, df_=df_sub, var_='cast', show_type='TV Show', title_='Most Watched Actors', independant_label_='', dependant_label_='', unit_='TV Shows', highlight_num=5, highlight_color='#3A5BA0', orientation_='h')

fig.set_size_inches(16,16)
plt.subplots_adjust(hspace = 0.25, wspace=0.25)

plt.show()

Actor & Director by Type of Show

In [ ]:
df_sub = df[['show_id', 'type', 'director', 'cast']].drop(df.loc[(df['director']=='Unknown')|(df['cast']=='Unknown')].index).drop_duplicates()
df_sub['actor_director'] = df['cast'] + ' & ' + df['director']
fig, axd = plt.subplot_mosaic([['upper', 'upper'],
                               ['lower left', 'lower right']])
ax1 = axd['upper']
ax2 = axd['lower left']
ax3 = axd['lower right']

plt_dist_plot(ax_=ax1, df_=df_sub, var_='actor_director', title_='Shows per Actor & Director Count', xlabel_='Shows per Actor & Director', ylabel_='Density')
plt_most10_count_plot(ax_=ax2, df_=df_sub, var_='actor_director', show_type='Movie', title_='Most Watched Actor & Director', independant_label_='', dependant_label_='', unit_='Movies', highlight_num=5, highlight_color='#DC3535', orientation_='h')
plt_most10_count_plot(ax_=ax3, df_=df_sub, var_='actor_director', show_type='TV Show', title_='Most Watched Actor & Director', independant_label_='', dependant_label_='', unit_='TV Shows', highlight_num=7, highlight_color='#3A5BA0', orientation_='h')

fig.set_size_inches(16,16)
plt.subplots_adjust(hspace = 0.25, wspace=0.25)

plt.show()

Genre by Type of Show

In [ ]:
df_sub = df[['show_id', 'type', 'listed_in']].drop(df.loc[df['listed_in']=='Unknown'].index).drop_duplicates()
fig, axd = plt.subplot_mosaic([['upper', 'upper'],
                               ['lower left', 'lower right']])
ax1 = axd['upper']
ax2 = axd['lower left']
ax3 = axd['lower right']

plt_dist_plot(ax_=ax1, df_=df_sub, var_='listed_in', title_='Shows per Genre Count', xlabel_='Shows per Genre', ylabel_='Density')
plt_most10_count_plot(ax_=ax2, df_=df_sub, var_='listed_in', show_type='Movie', title_='Most Watched Genre', independant_label_='', dependant_label_='', unit_='Movies', highlight_num=3, highlight_color='#DC3535', orientation_='h')
plt_most10_count_plot(ax_=ax3, df_=df_sub, var_='listed_in', show_type='TV Show', title_='Most Watched Genre', independant_label_='', dependant_label_='', unit_='TV Shows', highlight_num=3, highlight_color='#3A5BA0', orientation_='h')

fig.set_size_inches(16,16)
plt.subplots_adjust(hspace = 0.25, wspace=0.25)

plt.show()

Rating by Type of Show

In [ ]:
df_sub = df[['show_id', 'type', 'rating']].drop_duplicates()
fig, axd = plt.subplot_mosaic([['upper', 'upper'],
                               ['lower left', 'lower right']])
ax1 = axd['upper']
ax2 = axd['lower left']
ax3 = axd['lower right']

plt_dist_plot(ax_=ax1, df_=df_sub, var_='rating', title_='Shows per Rating Count', xlabel_='Shows per Rating', ylabel_='Density')
plt_most10_count_plot(ax_=ax2, df_=df_sub, var_='rating', show_type='Movie', title_='Most Watched Rating', independant_label_='', dependant_label_='', unit_='Movies', highlight_num=3, highlight_color='#DC3535', orientation_='h')
plt_most10_count_plot(ax_=ax3, df_=df_sub, var_='rating', show_type='TV Show', title_='Most Watched Rating', independant_label_='', dependant_label_='', unit_='TV Show', highlight_num=3, highlight_color='#3A5BA0', orientation_='h')

fig.set_size_inches(16,16)
plt.subplots_adjust(hspace = 0.25, wspace=0.25)

plt.show()

Duration by Type of Show

In [ ]:
df_sub = df[['show_id', 'type', 'duration']].drop_duplicates()
df_sub.loc[df['type']=='Movie', 'duration'] = np.ceil(df_sub.loc[df['type']=='Movie', 'duration']/10)*10
fig, axd = plt.subplot_mosaic([['upper', 'upper'],
                               ['lower left', 'lower right']])
ax1 = axd['upper']
ax2 = axd['lower left']
ax3 = axd['lower right']

plt_dist_plot(ax_=ax1, df_=df_sub, var_='duration', title_='Shows per Duration', xlabel_='Shows per Duration', ylabel_='Density')
plt_most10_count_plot(ax_=ax2, df_=df_sub, var_='duration', show_type='Movie', title_='Most Shows Duration', independant_label_='', dependant_label_='', unit_='Movies', highlight_num=3, highlight_color='#DC3535', orientation_='h')
plt_most10_count_plot(ax_=ax3, df_=df_sub, var_='duration', show_type='TV Show', title_='Most Shows Duration', independant_label_='', dependant_label_='', unit_='TV Show', highlight_num=3, highlight_color='#3A5BA0', orientation_='h')

fig.set_size_inches(16,16)
plt.subplots_adjust(hspace = 0.25, wspace=0.25)

plt.show()

Date Added by Type of Show

In [ ]:
df_sub = df[['show_id', 'type', 'date_added']].drop_duplicates()
df_sub['date_added'] = df_sub['date_added'].dt.year
fig, axd = plt.subplot_mosaic([['upper', 'upper'],
                               ['lower left', 'lower right']])
ax1 = axd['upper']
ax2 = axd['lower left']
ax3 = axd['lower right']

plt_dist_plot(ax_=ax1, df_=df_sub, var_='date_added', title_='Shows per Date Added Count (Year)', xlabel_='Shows per Date Added (Year) ', ylabel_='Density')
plt_most10_count_plot(ax_=ax2, df_=df_sub, var_='date_added', show_type='Movie', title_='Most Shows Date Added ', independant_label_='', dependant_label_='', unit_='Movies', highlight_num=3, highlight_color='#DC3535', orientation_='h')
plt_most10_count_plot(ax_=ax3, df_=df_sub, var_='date_added', show_type='TV Show', title_='Most Shows Date Added ', independant_label_='', dependant_label_='', unit_='TV Show', highlight_num=3, highlight_color='#3A5BA0', orientation_='h')

fig.set_size_inches(16,16)
plt.subplots_adjust(hspace = 0.25, wspace=0.25)

plt.show()

Release Year by Type of Show

In [ ]:
df_sub = df[['show_id', 'type', 'release_year']].drop_duplicates()
# df_sub['date_added'] = df_sub['release_year'].dt.year
fig, axd = plt.subplot_mosaic([['upper', 'upper'],
                               ['lower left', 'lower right']])
ax1 = axd['upper']
ax2 = axd['lower left']
ax3 = axd['lower right']

plt_dist_plot(ax_=ax1, df_=df_sub, var_='release_year', title_='Shows per Release Year', xlabel_='Shows per Release Year', ylabel_='Density')
plt_most10_count_plot(ax_=ax2, df_=df_sub, var_='release_year', show_type='Movie', title_='Most Shows Release Year', independant_label_='', dependant_label_='', unit_='Movies', highlight_num=4, highlight_color='#DC3535', orientation_='h')
plt_most10_count_plot(ax_=ax3, df_=df_sub, var_='release_year', show_type='TV Show', title_='Most Shows Release Year', independant_label_='', dependant_label_='', unit_='TV Show', highlight_num=4, highlight_color='#3A5BA0', orientation_='h')

fig.set_size_inches(16,16)
plt.subplots_adjust(hspace = 0.25, wspace=0.25)

plt.show()

Country by Type of Show

In [ ]:
df_sub = df[['show_id', 'type', 'country']].drop(df.loc[df['country']=='Unknown'].index).drop_duplicates()
fig, axd = plt.subplot_mosaic([['upper', 'upper'],
                               ['lower left', 'lower right']])
ax1 = axd['upper']
ax2 = axd['lower left']
ax3 = axd['lower right']

plt_dist_plot(ax_=ax1, df_=df_sub, var_='country', title_='Shows per Country Count', xlabel_='Shows per Country', ylabel_='Density')
plt_most10_count_plot(ax_=ax2, df_=df_sub, var_='country', show_type='Movie', title_='Most Shows per Country', independant_label_='', dependant_label_='', unit_='Movies', highlight_num=5, highlight_color='#DC3535', orientation_='h')
plt_most10_count_plot(ax_=ax3, df_=df_sub, var_='country', show_type='TV Show', title_='Most Shows per Country', independant_label_='', dependant_label_='', unit_='TV Shows', highlight_num=5, highlight_color='#3A5BA0', orientation_='h')

fig.set_size_inches(16,16)
plt.subplots_adjust(hspace = 0.25, wspace=0.25)

plt.show()

Country Wise Analysis¶

In [ ]:
def country_wise_analysis(country_):
  print('Director by Type of Show')
  df_sub = df[['show_id', 'country', 'type', 'director']].drop(df.loc[df['director']=='Unknown'].index).drop_duplicates()
  df_sub = df_sub.loc[df_sub['country']==country_]
  fig, axd = plt.subplot_mosaic([['upper', 'upper'],
                                ['lower left', 'lower right']])
  ax1 = axd['upper']
  ax2 = axd['lower left']
  ax3 = axd['lower right']

  plt_dist_plot(ax_=ax1, df_=df_sub, var_='director', title_=f'Shows per Director Count ({country_})', xlabel_='Shows per Director', ylabel_='Density')
  plt_most10_count_plot(ax_=ax2, df_=df_sub, var_='director', show_type='Movie', title_='Most Watched Directors', independant_label_='', dependant_label_='', unit_='Movies', highlight_num=3, highlight_color='#DC3535', orientation_='h')
  plt_most10_count_plot(ax_=ax3, df_=df_sub, var_='director', show_type='TV Show', title_='Most Watched Directors', independant_label_='', dependant_label_='', unit_='TV Shows', highlight_num=3, highlight_color='#3A5BA0', orientation_='h')

  fig.set_size_inches(16,16)
  plt.subplots_adjust(hspace = 0.25, wspace=0.25)

  plt.show()

  # ------------------------------------------
  print('Cast by Type of Show')
  df_sub = df[['show_id', 'country', 'type', 'cast']].drop(df.loc[df['cast']=='Unknown'].index).drop_duplicates()
  df_sub = df_sub.loc[df_sub['country']==country_]
  fig, axd = plt.subplot_mosaic([['upper', 'upper'],
                                ['lower left', 'lower right']])
  ax1 = axd['upper']
  ax2 = axd['lower left']
  ax3 = axd['lower right']

  plt_dist_plot(ax_=ax1, df_=df_sub, var_='cast', title_=f'Shows per Actor Count ({country_})', xlabel_='Shows per Actor', ylabel_='Density')
  plt_most10_count_plot(ax_=ax2, df_=df_sub, var_='cast', show_type='Movie', title_='Most Watched Actors', independant_label_='', dependant_label_='', unit_='Movies', highlight_num=3, highlight_color='#DC3535', orientation_='h')
  plt_most10_count_plot(ax_=ax3, df_=df_sub, var_='cast', show_type='TV Show', title_='Most Watched Actors', independant_label_='', dependant_label_='', unit_='TV Shows', highlight_num=3, highlight_color='#3A5BA0', orientation_='h')

  fig.set_size_inches(16,16)
  plt.subplots_adjust(hspace = 0.25, wspace=0.25)

  plt.show()

  # ------------------------------------------
  print('Actor & Director by Type of Show')
  df_sub = df[['show_id', 'country', 'type', 'director', 'cast']].drop(df.loc[(df['director']=='Unknown')|(df['cast']=='Unknown')].index).drop_duplicates()
  df_sub = df_sub.loc[df_sub['country']==country_]
  df_sub['actor_director'] = df['cast'] + ' & ' + df['director']
  fig, axd = plt.subplot_mosaic([['upper', 'upper'],
                                ['lower left', 'lower right']])
  ax1 = axd['upper']
  ax2 = axd['lower left']
  ax3 = axd['lower right']

  plt_dist_plot(ax_=ax1, df_=df_sub, var_='actor_director', title_=f'Shows per Actor & Director Count ({country_})', xlabel_='Shows per Actor & Director', ylabel_='Density')
  plt_most10_count_plot(ax_=ax2, df_=df_sub, var_='actor_director', show_type='Movie', title_='Most Watched Actor & Director', independant_label_='', dependant_label_='', unit_='Movies', highlight_num=3, highlight_color='#DC3535', orientation_='h')
  plt_most10_count_plot(ax_=ax3, df_=df_sub, var_='actor_director', show_type='TV Show', title_='Most Watched Actor & Director', independant_label_='', dependant_label_='', unit_='TV Shows', highlight_num=3, highlight_color='#3A5BA0', orientation_='h')

  fig.set_size_inches(16,16)
  plt.subplots_adjust(hspace = 0.25, wspace=0.25)

  plt.show()

   # ------------------------------------------
  print('Genre by Type of Show')
  df_sub = df[['show_id', 'country', 'type', 'listed_in']].drop(df.loc[df['listed_in']=='Unknown'].index).drop_duplicates()
  df_sub = df_sub.loc[df_sub['country']==country_]
  fig, axd = plt.subplot_mosaic([['upper', 'upper'],
                                ['lower left', 'lower right']])
  ax1 = axd['upper']
  ax2 = axd['lower left']
  ax3 = axd['lower right']

  plt_dist_plot(ax_=ax1, df_=df_sub, var_='listed_in', title_=f'Shows per Genre Count ({country_})', xlabel_='Shows per Genre', ylabel_='Density')
  plt_most10_count_plot(ax_=ax2, df_=df_sub, var_='listed_in', show_type='Movie', title_='Most Watched Genre', independant_label_='', dependant_label_='', unit_='Movies', highlight_num=3, highlight_color='#DC3535', orientation_='h')
  plt_most10_count_plot(ax_=ax3, df_=df_sub, var_='listed_in', show_type='TV Show', title_='Most Watched Genre', independant_label_='', dependant_label_='', unit_='TV Shows', highlight_num=3, highlight_color='#3A5BA0', orientation_='h')

  fig.set_size_inches(16,16)
  plt.subplots_adjust(hspace = 0.25, wspace=0.25)

  plt.show()

  # ------------------------------------------
  print('Rating by Type of Show')
  df_sub = df[['show_id', 'country', 'type', 'rating']].drop_duplicates()
  df_sub = df_sub.loc[df_sub['country']==country_]
  fig, axd = plt.subplot_mosaic([['upper', 'upper'],
                                ['lower left', 'lower right']])
  ax1 = axd['upper']
  ax2 = axd['lower left']
  ax3 = axd['lower right']

  plt_dist_plot(ax_=ax1, df_=df_sub, var_='rating', title_=f'Shows per Rating Count ({country_})', xlabel_='Shows per Rating', ylabel_='Density')
  plt_most10_count_plot(ax_=ax2, df_=df_sub, var_='rating', show_type='Movie', title_='Most Watched Rating', independant_label_='', dependant_label_='', unit_='Movies', highlight_num=3, highlight_color='#DC3535', orientation_='h')
  plt_most10_count_plot(ax_=ax3, df_=df_sub, var_='rating', show_type='TV Show', title_='Most Watched Rating', independant_label_='', dependant_label_='', unit_='TV Show', highlight_num=3, highlight_color='#3A5BA0', orientation_='h')

  fig.set_size_inches(16,16)
  plt.subplots_adjust(hspace = 0.25, wspace=0.25)

  plt.show()

  # ------------------------------------------
  print('Duration by Type of Show')
  df_sub = df[['show_id', 'country', 'type', 'duration']].drop_duplicates()
  df_sub = df_sub.loc[df_sub['country']==country_]
  df_sub.loc[df['type']=='Movie', 'duration'] = np.ceil(df_sub.loc[df['type']=='Movie', 'duration']/10)*10
  fig, axd = plt.subplot_mosaic([['upper', 'upper'],
                                ['lower left', 'lower right']])
  ax1 = axd['upper']
  ax2 = axd['lower left']
  ax3 = axd['lower right']

  plt_dist_plot(ax_=ax1, df_=df_sub, var_='duration', title_=f'Shows per Duration ({country_})', xlabel_='Shows per Duration', ylabel_='Density')
  plt_most10_count_plot(ax_=ax2, df_=df_sub, var_='duration', show_type='Movie', title_='Most Shows Duration', independant_label_='', dependant_label_='', unit_='Movies', highlight_num=3, highlight_color='#DC3535', orientation_='h')
  plt_most10_count_plot(ax_=ax3, df_=df_sub, var_='duration', show_type='TV Show', title_='Most Shows Duration', independant_label_='', dependant_label_='', unit_='TV Show', highlight_num=3, highlight_color='#3A5BA0', orientation_='h')

  fig.set_size_inches(16,16)
  plt.subplots_adjust(hspace = 0.25, wspace=0.25)

  plt.show()
  
  # ------------------------------------------
  print('Date Added by Type of Show')
  df_sub = df[['show_id', 'country', 'type', 'date_added']].drop_duplicates()
  df_sub = df_sub.loc[df_sub['country']==country_]
  df_sub['date_added'] = df_sub['date_added'].dt.year
  fig, axd = plt.subplot_mosaic([['upper', 'upper'],
                                ['lower left', 'lower right']])
  ax1 = axd['upper']
  ax2 = axd['lower left']
  ax3 = axd['lower right']

  plt_dist_plot(ax_=ax1, df_=df_sub, var_='date_added', title_=f'Shows per Date Added Count Yearly ({country_})', xlabel_='Shows per Date Added (Year) ', ylabel_='Density')
  plt_most10_count_plot(ax_=ax2, df_=df_sub, var_='date_added', show_type='Movie', title_='Most Shows Date Added ', independant_label_='', dependant_label_='', unit_='Movies', highlight_num=3, highlight_color='#DC3535', orientation_='h')
  plt_most10_count_plot(ax_=ax3, df_=df_sub, var_='date_added', show_type='TV Show', title_='Most Shows Date Added ', independant_label_='', dependant_label_='', unit_='TV Show', highlight_num=3, highlight_color='#3A5BA0', orientation_='h')

  fig.set_size_inches(16,16)
  plt.subplots_adjust(hspace = 0.25, wspace=0.25)

  plt.show()

  # ------------------------------------------
  print('Release Year by Type of Show')
  df_sub = df[['show_id', 'country', 'type', 'release_year']].drop_duplicates()
  df_sub = df_sub.loc[df_sub['country']==country_]
  fig, axd = plt.subplot_mosaic([['upper', 'upper'],
                                ['lower left', 'lower right']])
  ax1 = axd['upper']
  ax2 = axd['lower left']
  ax3 = axd['lower right']

  plt_dist_plot(ax_=ax1, df_=df_sub, var_='release_year', title_=f'Shows per Release Year ({country_})', xlabel_='Shows per Release Year', ylabel_='Density')
  plt_most10_count_plot(ax_=ax2, df_=df_sub, var_='release_year', show_type='Movie', title_='Most Shows Release Year', independant_label_='', dependant_label_='', unit_='Movies', highlight_num=3, highlight_color='#DC3535', orientation_='h')
  plt_most10_count_plot(ax_=ax3, df_=df_sub, var_='release_year', show_type='TV Show', title_='Most Shows Release Year', independant_label_='', dependant_label_='', unit_='TV Show', highlight_num=3, highlight_color='#3A5BA0', orientation_='h')

  fig.set_size_inches(16,16)
  plt.subplots_adjust(hspace = 0.25, wspace=0.25)

  plt.show()

Country Wise Analysis: US¶

In [ ]:
country_wise_analysis('United States')
Director by Type of Show
Cast by Type of Show
Actor & Director by Type of Show
Genre by Type of Show
Rating by Type of Show
Duration by Type of Show
Date Added by Type of Show
Release Year by Type of Show

Country Wise Analysis: India¶

In [ ]:
country_wise_analysis('India')
Director by Type of Show
Cast by Type of Show
Actor & Director by Type of Show
Genre by Type of Show
Rating by Type of Show
Duration by Type of Show
Date Added by Type of Show
Release Year by Type of Show

Country Wise Analysis: UK¶

In [ ]:
country_wise_analysis('United Kingdom')
Director by Type of Show
Cast by Type of Show
Actor & Director by Type of Show
Genre by Type of Show
Rating by Type of Show
Duration by Type of Show
Date Added by Type of Show
Release Year by Type of Show

Country Wise Analysis: Canada¶

In [ ]:
country_wise_analysis('Canada')
Director by Type of Show
/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:316: UserWarning: Dataset has 0 variance; skipping density estimate. Pass `warn_singular=False` to disable this warning.
  warnings.warn(msg, UserWarning)
Cast by Type of Show
Actor & Director by Type of Show
/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:316: UserWarning: Dataset has 0 variance; skipping density estimate. Pass `warn_singular=False` to disable this warning.
  warnings.warn(msg, UserWarning)
Genre by Type of Show
Rating by Type of Show
Duration by Type of Show
Date Added by Type of Show
Release Year by Type of Show

Country Wise Analysis: France¶

In [ ]:
country_wise_analysis('France')
Director by Type of Show
/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:316: UserWarning: Dataset has 0 variance; skipping density estimate. Pass `warn_singular=False` to disable this warning.
  warnings.warn(msg, UserWarning)
Cast by Type of Show
Actor & Director by Type of Show
/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:316: UserWarning: Dataset has 0 variance; skipping density estimate. Pass `warn_singular=False` to disable this warning.
  warnings.warn(msg, UserWarning)
Genre by Type of Show
Rating by Type of Show
Duration by Type of Show
Date Added by Type of Show
Release Year by Type of Show

Country Wise Analysis: Japan¶

In [ ]:
country_wise_analysis('Japan')
Director by Type of Show
/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:316: UserWarning: Dataset has 0 variance; skipping density estimate. Pass `warn_singular=False` to disable this warning.
  warnings.warn(msg, UserWarning)
Cast by Type of Show
Actor & Director by Type of Show
/usr/local/lib/python3.8/dist-packages/seaborn/distributions.py:316: UserWarning: Dataset has 0 variance; skipping density estimate. Pass `warn_singular=False` to disable this warning.
  warnings.warn(msg, UserWarning)
Genre by Type of Show
Rating by Type of Show
Duration by Type of Show
Date Added by Type of Show
Release Year by Type of Show

Country Wise Analysis: South Korea¶

In [ ]:
country_wise_analysis('South Korea')
Director by Type of Show
Cast by Type of Show
Actor & Director by Type of Show
Genre by Type of Show
Rating by Type of Show
Duration by Type of Show
Date Added by Type of Show
Release Year by Type of Show

4.2 For categorical variable(s): Boxplot¶

In [ ]:
def plot_hist_box(var_, title_, xlabel_, movie_x_major, movie_x_minor, tv_x_major, tv_x_minor, is_date=False):  
  df_ = df[['show_id', 'type', var_]].drop_duplicates()
  if is_date:
    df_[var_] = df_[var_].dt.year
  # ----------------------------------------
  m1 = df_.loc[df['type']=='Movie', var_].mean()
  st1 = df_.loc[df['type']=='Movie', var_].std()

  q1 = df_.loc[df['type']=='Movie', var_].quantile(.25)
  q2 = df_.loc[df['type']=='Movie', var_].median()
  q3 = df_.loc[df['type']=='Movie', var_].quantile(.75)

  #
  ax1 = plt.subplot(2,2,1)
  sns.histplot(data=df_.loc[df['type']=='Movie'], x=var_, 
               binwidth=movie_x_minor, 
               color='#DC3535', ax=ax1, linewidth=2)
  df_mean = pd.DataFrame({'x': [m1, m1], 'y': ax1.get_ybound()})
  df_q1 = pd.DataFrame({'x': [q1, q1], 'y': ax1.get_ybound()})
  df_q2 = pd.DataFrame({'x': [q2, q2], 'y': ax1.get_ybound()})
  df_q3 = pd.DataFrame({'x': [q3, q3], 'y': ax1.get_ybound()})
  sns.lineplot(data=df_mean, x='x', y='y', color='red', ax=ax1, linestyle='--',
              estimator=None, linewidth = 2)
  sns.lineplot(data=df_q1, x='x', y='y', color='black', ax=ax1, linestyle='--',
              estimator=None, linewidth = 1)
  sns.lineplot(data=df_q2, x='x', y='y', color='cyan', ax=ax1, linestyle='--',
              estimator=None, linewidth = 2)
  sns.lineplot(data=df_q3, x='x', y='y', color='black', ax=ax1, linestyle='--',
              estimator=None, linewidth = 1)

  plt.sca(ax1)
  plt.title(f'{title_} for Movies', size=16, color='grey')
  plt.tick_params(
        axis='x',          # changes apply to the x-axis
        which='both',      # both major and minor ticks are affected
        bottom=False,      # ticks along the bottom edge are off
        top=False,         # ticks along the top edge are off
        labelbottom=False) # labels along the bottom edge are off
  plt.xlabel('')
  plt.ylabel('Count', size=12)
  plt.yticks(size=12)
  sns.despine(bottom=True, left=False, trim=True, ax=ax1)
  ax1.spines['left'].set_color('grey')

  ax1.xaxis.set_major_locator(MultipleLocator(movie_x_major))
  ax1.xaxis.set_major_formatter('{x:.0f}')
  ax1.xaxis.set_minor_locator(MultipleLocator(movie_x_minor))

  #
  df_mean['y'] = [-0.3, 0.2]
  df_q1['y'] = [0.2, 0.25]
  df_q2['y'] = [0.1, 0.25]
  df_q3['y'] = [0.2, 0.25]
  ax2 = plt.subplot(2,2,3, sharex=ax1)
  sns.boxplot(data=df_.loc[df['type']=='Movie'], x=var_, ax=ax2,
                    color='#DC3535', showmeans=True,
                    # color='white',
                    flierprops={"marker": "x"}, medianprops={"color": "cyan"},
                    width=0.4, fliersize=1, linewidth=2, notch=True)
  sns.lineplot(data=df_mean, x='x', y='y', color='red', ax=ax2, linestyle='--',
              estimator=None, linewidth = 2)
  text = f' μ={m1:.1f}\n σ={st1:.1f}'
  ax2.annotate(text, xy=(m1, -0.3), rotation=90)

  sns.lineplot(data=df_q1, x='x', y='y', color='black', ax=ax2, linestyle='--',
              estimator=None, linewidth = 1)
  text = f'Q1={q1:.1f} '
  ax2.annotate(text, xy=(q1-0.1, 0.25), rotation=90, va='top', ha='right')
  sns.lineplot(data=df_q2, x='x', y='y', color='cyan', ax=ax2, linestyle='--',
              estimator=None, linewidth = 2)
  text = f'med={q2:.1f} '
  ax2.annotate(text, xy=(q2, 0.25), rotation=90, va='top', ha='center')
  sns.lineplot(data=df_q3, x='x', y='y', color='black', ax=ax2, linestyle='--',
              estimator=None, linewidth = 1)
  text = f'Q3={q3:.1f} '
  ax2.annotate(text, xy=(q3+0.1, 0.25), rotation=90, va='top', ha='left')

  plt.sca(ax2)
  plt.xlabel(xlabel_, size=12)
  plt.ylabel('')
  plt.xticks(size=12)
  sns.despine(bottom=False, left=False, ax=ax2)
  ax2.spines['bottom'].set_color('grey')
  ax2.spines['left'].set_color('grey')

  ax2.xaxis.set_major_locator(MultipleLocator(movie_x_major))
  ax2.xaxis.set_major_formatter('{x:.0f}')
  ax2.xaxis.set_minor_locator(MultipleLocator(movie_x_minor))

  # -------------------------------------------------
  m1 = df_.loc[df['type']=='TV Show', var_].mean()
  st1 = df_.loc[df['type']=='TV Show', var_].std()

  q1 = df_.loc[df['type']=='TV Show', var_].quantile(.25)
  q2 = df_.loc[df['type']=='TV Show', var_].median()
  q3 = df_.loc[df['type']=='TV Show', var_].quantile(.75)


  #
  ax3 = plt.subplot(2,2,2)
  sns.histplot(data=df_.loc[df['type']=='TV Show'], x=var_, ax=ax3,
                     binwidth=tv_x_minor, 
                     color='#3A5BA0', linewidth=2)
  
  df_mean = pd.DataFrame({'x': [m1, m1], 'y': ax3.get_ybound()})
  df_q1 = pd.DataFrame({'x': [q1, q1], 'y': ax3.get_ybound()})
  df_q2 = pd.DataFrame({'x': [q2, q2], 'y': ax3.get_ybound()})
  df_q3 = pd.DataFrame({'x': [q3, q3], 'y': ax3.get_ybound()})
  sns.lineplot(data=df_mean, x='x', y='y', color='red', ax=ax3, linestyle='--',
              estimator=None, linewidth = 2)
  sns.lineplot(data=df_q1, x='x', y='y', color='black', ax=ax3, linestyle='--',
              estimator=None, linewidth = 1)
  sns.lineplot(data=df_q2, x='x', y='y', color='cyan', ax=ax3, linestyle='--',
              estimator=None, linewidth = 2)
  sns.lineplot(data=df_q3, x='x', y='y', color='black', ax=ax3, linestyle='--',
              estimator=None, linewidth = 1)

  plt.sca(ax3)
  plt.title(f'{title_} for TV Show', size=16, color='grey')
  plt.tick_params(
        axis='x',          # changes apply to the x-axis
        which='both',      # both major and minor ticks are affected
        bottom=False,      # ticks along the bottom edge are off
        top=False,         # ticks along the top edge are off
        labelbottom=False) # labels along the bottom edge are off
  plt.xlabel('')
  plt.ylabel('')
  plt.yticks(size=12)
  sns.despine(bottom=True, left=False, trim=True, ax=ax3)
  ax3.spines['left'].set_color('grey')

  ax3.xaxis.set_major_locator(MultipleLocator(tv_x_major))
  ax3.xaxis.set_major_formatter('{x:.0f}')
  ax3.xaxis.set_minor_locator(MultipleLocator(tv_x_minor))

  #
  df_mean['y'] = [-0.3, 0.2]
  df_q1['y'] = [0.2, 0.25]
  df_q2['y'] = [0.1, 0.25]
  df_q3['y'] = [0.2, 0.25]
  ax4 = plt.subplot(2,2,4, sharex=ax3)
  sns.boxplot(data=df_.loc[df['type']=='TV Show'], x=var_, 
                    color='#3A5BA0', showmeans=True, ax=ax4,
                    # color='white',
                    flierprops={"marker": "x"}, medianprops={"color": "cyan"},
                    width=0.4, fliersize=1, linewidth=2, notch=True)
  sns.lineplot(data=df_mean, x='x', y='y', color='red', ax=ax4, linestyle='--',
              estimator=None, linewidth = 2)
  text = f' μ={m1:.1f}\n σ={st1:.1f}'
  ax4.annotate(text, xy=(m1, -0.3), rotation=90)


  sns.lineplot(data=df_q1, x='x', y='y', color='black', ax=ax4, linestyle='--',
              estimator=None, linewidth = 1)
  text = f'Q1={q1:.1f} '
  ax4.annotate(text, xy=(q1-0.1, 0.25), rotation=90, va='top', ha='right')
  sns.lineplot(data=df_q2, x='x', y='y', color='cyan', ax=ax4, linestyle='--',
              estimator=None, linewidth = 2)
  text = f'med={q2:.1f} '
  ax4.annotate(text, xy=(q2, 0.25), rotation=90, va='top', ha='center')
  sns.lineplot(data=df_q3, x='x', y='y', color='black', ax=ax4, linestyle='--',
              estimator=None, linewidth = 1)
  text = f'Q3={q3:.1f} '
  ax4.annotate(text, xy=(q3+0.1, 0.25), rotation=90, va='top', ha='left')

  plt.sca(ax4)
  plt.xlabel(xlabel_, size=12)
  plt.ylabel('')
  plt.xticks(size=12)
  sns.despine(bottom=False, left=False, ax=ax4)
  ax4.spines['bottom'].set_color('grey')
  ax4.spines['left'].set_color('grey')

  ax4.xaxis.set_major_locator(MultipleLocator(tv_x_major))
  ax4.xaxis.set_major_formatter('{x:.0f}')
  ax4.xaxis.set_minor_locator(MultipleLocator(tv_x_minor))

  ax4.figure.set_size_inches(16,10)
  plt.subplots_adjust(top=0.95, right=0.86, hspace=0, wspace=0.1)
  plt.show()
In [ ]:
plot_hist_box(var_='duration', title_='Distribution of Duration', xlabel_='Duration', movie_x_major=100, movie_x_minor=20, tv_x_major=5, tv_x_minor=1)
In [ ]:
plot_hist_box(var_='release_year', title_='Distribution of Release Year', xlabel_='Release Year', movie_x_major=20, movie_x_minor=5, tv_x_major=20, tv_x_minor=5)
In [ ]:
plot_hist_box(var_='date_added', title_='Distribution of Date Added', xlabel_='Date Added', movie_x_major=5, movie_x_minor=1, tv_x_major=5, tv_x_minor=1, is_date=True)

4.3 For correlation: Heatmaps, Pairplots¶

In [ ]:
df_numeric = df[['show_id', 'type', 'duration', 'date_added', 'release_year']].drop_duplicates()
df_numeric['date_added'] = df_numeric['date_added'].dt.year
In [ ]:
sns.pairplot(data=df_numeric, hue='type', hue_order=['Movie', 'TV Show'], palette=['#DC3535', '#3A5BA0'])
plt.show()
In [ ]:
sns.heatmap(data=df_numeric.loc[df['type']=='Movie'].corr(), annot=True, cmap='Blues')
plt.show()
In [ ]:
sns.heatmap(data=df_numeric.loc[df['type']=='TV Show'].corr(), annot=True, cmap='Blues')
plt.show()

6. Insights based on Non-Graphical and Visual Analysis¶

Show_id:

  • All the values are unique

Type:

  • There are 70% Movies (6130) and 30% TV Shows (2676)

Title:

  • There is a duplicate title: 'Consequences'

Description:

  • Most of the shows have unique descriptions

Rating:

  • Movie
    • There are 8 unique categories of rating
    • Most (2862) shows have rating: 17+
  • TV Show
    • There are 7 unique categories of rating
    • Most (1147) shows have rating: 17+

Director:

  • Movie
    • There are 4777 directors
    • Rajiv Chilaka has directed the most (22) shows
  • TV Show
    • There are 229 directors
    • Alastair Fothergill has directed the most (3) shows

Cast:

  • Movie
    • There are 25951 actors
    • Anupam Kher has acted in the most (42) shows
  • TV Show
    • There are 14863 actors
    • Takahiro Sakurai has acted in the most (25) shows

Actor & Director:

  • Movie
    • There are 45952 actor & director duos
    • Julie Tejwani - Rajiv Chilaka & Rajesh Kava - Rajiv Chilaka has worked in the most (19) shows
  • TV Show
    • There are 2357 actor & director duos
    • David Attenborough - Alastair Fothergill has worked in the most (3) shows

Country:

  • Movie
    • Shows were produced in 118 countries
    • United States alone has produced the most (2.8 k) shows
  • TV Show
    • Shows were produced in 66 countries
    • United States alone has produced the most (938) shows

Genre:

  • Movie
    • There are 20 genres
    • Most of the shows fall under ‘Drama’ (2.4 k) and ‘Comedy’ (1.7 k) Genre
  • TV Show
    • There are 22 genres
    • Most of the shows fall under ‘Drama’ (763) and ‘Comedy’ (581) Genre

Duration:

  • Movie
    • Movies duration ranges from 3 mins to 312 mins
    • Data Distribution
      • Middle 50% of the data is of the movies having duration from 87 mins till 114 mins
      • Lower 25% of the data is of the movies having duration from 3 mins till 87 mins
      • Upper 25% of the data is of the movies having duration  from 87 mins till 312 mins
  • TV Show
    • TV show duration ranges from 1 season to 17 seasons
    • Data Distribution
      • Lower 50% of the data is of the TV shows having only 1 season
      • 25% of the data is of the TV shows having 2 seasons
      • 25% of the data is of the TV shows having 2 to 17 seasons

Release Year:

  • Movie
    • Data is from the year 1942 to 2021
    • Data Distribution
      • 50% of the data is of the movies that are released from 2016 till 2021 (~6 yrs)
      • 25% of the data is of the movies that are released from 2012 till 2016 (~4 yrs)
      • 25% of the data is of the movies that are released from 1942 till 2012 (~70 yrs)
  • TV Show
    • Data is from the year 1925 to 2021
    • Data Distribution
      • 50% of the data is of the TV shows that are released from 2018 till 2021 (~4 yrs)
      • 25% of the data is of the TV shows that are released from 2016 till 2018 (~2 yrs)
      • 25% of the data is of the TV shows that are released from 1925 till 2016 (~91 yrs)

Date Added:

  • Data is from the year 2008 to 2021
  • Data Distribution
    • 50% of the data is of the shows that are added from 2019 till 2021 (~2 yrs)
    • 25% of the data is of the shows that are added from 2018 till 2019 (~1 yrs)
    • 25% of the data is of the shows that are added from 2008 till 2018 (~10 yrs)
  • Most movies were added in year 2019
  • Most TV shows were added in year 2020

Contry Wise Insights:

US

  • Director
    • Most preferred movies are directed by Jay Karas and Marcus Raboy
    • Most preferred TV shows are directed by Ken Burns
  • Actor
    • Most preferred movies are of Samuel L. Jackson, Adam Sandler and James Franco
    • Most preferred TV shows are of Grey Griffin
  • Genre
    • Most preferred movies are of genre Drama, Comedy and Documentary
    • Most preferred TV shows are of genre Comedy, Drama and Kids’ TV
  • Rating
    • Most preferred movies are of 17+ category
    • Most preferred TV shows are of 17+ and 14+ category
  • Duration
    • Most preferred movies have duration between 80 mins to 110 mins
    • Most preferred TV shows have only 1 season

India

  • Director
    • Most preferred movies are directed by David Dhawan, Umesh Mehra and Anurag Kashyap
    • Most preferred TV shows directed by Gautham Vasudev Menon
  • Actor
    • Most preferred movies are of Anupam Kher, Shah Rukh Kahn and Naseeruddin Shah
    • Most preferred TV shows are of Nishka Raheja and Rajesh Kava
  • Genre
    • Most preferred movies are of genre Drama and Comedy
    • Most preferred TV shows are of genre Drama and Comedy
  • Rating
    • Most preferred movies are of 14+ and 17+ category
    • Most preferred TV shows are of 17+ and 14+ category
  • Duration
    • Most preferred movies have duration between 110 mins to 130 mins
    • Most preferred TV shows have only 1 season

UK

  • Director
    • Most preferred movies are directed by Edward Cotterill
    • Most preferred TV shows directed by Alastair Fothergill
  • Actor
    • Most preferred movies are of John Cleese
    • Most preferred TV shows are of David Attenborough
  • Genre
    • Most preferred movies are of genre Drama and Documentary
    • Most preferred TV shows are of genre British TV Shows and Docuseries
  • Rating
    • Most preferred movies are of 17+ category
    • Most preferred TV shows are of 17+ category
  • Duration
    • Most preferred movies have duration between 80 mins to 110 mins
    • Most preferred TV shows have only 1 season

Canada

  • Director
    • Most preferred movies are directed by Justin G. Dyck
  • Actor
    • Most preferred movies are of Robb Wells, John Paul Tremblay and John - Dunsworth
    • Most preferred TV shows are of Ashleigh Ball
  • Genre
    • Most preferred movies are of genre Comedy, Drama and ‘Children and Family - - Movies’
    • Most preferred TV shows are of genre Kids’ TV, Drama and Comedy
  • Rating
    • Most preferred movies are of 17+ category
    • Most preferred TV shows are of 17+ and 7+ category
  • Duration
    • Most preferred movies have duration between 80 mins to 100 mins
    • Most preferred TV shows have 1 and 2 season

France

  • Director
    • Most preferred movies are directed by Thierry Donald and Youssef Chahine
  • Actor
    • Most preferred movies are of Wille Lindberg
  • Genre
    • Most preferred movies are of genre Drama
    • Most preferred TV shows are of genre Kids’ TV
  • Rating
    • Most preferred movies are of 17+ category
    • Most preferred TV shows are of 17+ category
  • Duration
    • Most preferred movies have duration between 80 mins to 110 mins
    • Most preferred TV shows have only 1 season

Japan

  • Director
    • Most preferred movies are directed by Toshiya Shinohara
  • Actor
    • Most preferred movies are of Yuki Kaji
    • Most preferred TV shows are of Takahiro Sakurai
  • Genre
    • Most preferred movies are of genre Anime Features and Action & Adventure
    • Most preferred TV shows are of genre Anime Series
  • Rating
    • Most preferred movies are of 17+ and Parental Guidance category
    • Most preferred TV shows are of 14+ and 17+ category
  • Duration
    • Most preferred movies have duration between 90 mins to 100 mins
    • Most preferred TV shows have only 1 season

South Korea

  • Actor
    • Most preferred movies are of Kyeong-yeong Lee
    • Most preferred TV shows are of Cho Seong-ha, Kim Won-hae and Sung Dong-li
  • Genre
    • Most preferred movies are of genre Drama
    • Most preferred TV shows are of genre Korean TV Shows
  • Rating
    • Most preferred movies are of 17+ category
    • Most preferred TV shows are of 14+ category
  • Duration
    • Most preferred movies have duration between 100 mins to 140 mins
    • Most preferred TV shows have only 1 season

7. Business Insights¶

Generic Insights

  • There are 70% Movies and 30% TV Shows
  • Most preferred TV shows have only 1 season
  • Most preferred movies have duration between 80 mins to 110 mins
  • Most preferred Shows (Movie and TV Show) are of 17+ category
  • Most of the shows fall under ‘Drama’ and ‘Comedy’ Genre
  • Out of all the countries where it has presence, It has comparatively strong user base in some of the countries like (For movies) US, India, UK, Canada, France; (for TV Shows) US, UK, Japan, South Korea, Canada

Additional County Specific Insights

  • US
    • Genre
      • For Movies: Documentary genre is also preferred
      • For TV Shows: Kids’ TV genre is also preferred
  • India
    • Type
      • Movies are preferred over TV Shows
    • Duration
      • For Movies: Preferred duration is between 110 mins to 130 mins
    • Rating
      • Shows (Movies & TV Shows) of 14+ category are also preferred
  • UK
    • Genre
      • For Movies: Documentary genre is also preferred
      • For TV Shows: British TV Shows and Docuseries genres are also preferred
    • Actor
      • Most preferred TV shows are of David Attenborough
  • Canada
    • Genre
      • For Movies: Children and Family Movies genre is also preferred
      • For TV Shows: Kids’ TV genre is also preferred (7+ Category)
    • Duration
      • For TV Shows: Season 2 is also preferred
  • France
    • Genre
      • For TV Shows: Kids’ TV genre is also preferred
  • Japan
    • Type
      • TV Shows are preferred over Movies
    • Genre
      • For Movies: Anime Features and Action & Adventure genres are preferred over Comedy and Drama
      • For TV Shows: Anime Series genre is preferred over Comedy and Drama
    • Rating
      • For Movies: Parental Guidance category is also preferred
      • For TV Shows: 14+ category is also preferred
  • South Korea
    • Type
      • TV Shows are preferred over Movies
    • Genre
      • For TV Shows: Korean TV Shows are preferred over Comedy and Drama
    • Rating
      • For TV Shows: 14+ Category is also preferred

8. Recommendations¶

Business Recommendation

  • Make more movies compared to TV shows especially for India with an exception for some countries like Japan and South Korea where TV Shows are preferred over Movies.
  • Consider introducing new shows for countries like US, India, UK, Canada, France, Japan, South Korea since these countries already have strong user base.
  • Consider making movies having duration between 80 mins to 110 mins with an exception to India where movies can be of litter longer duration from 110 mins to 130 mins.
  • In case of TV Shows instead of making new season consider making new TV Show. In case of Canada season 2 can also be considered.
  • Consider shows (movie & TV shows) that fall under Drama and Comedy Genre. For countries like US and UK, Documentary movies can also be considered. British TV Shows and Docuseries (TV Shows) are also a good option for UK. Kids’ TV Shows can be considered for US, France and Canada. Canada is even a good market for Children and Family Movies. However, strategy to decide content for countries like Japan and South Korea has to be different. Consider Anime Features and Action & Adventure Movies and Anime Series TV shows for Japan. In case of South Korea, Korean TV Shows shall get preference.
  • More preference should be given to the shows (Movie and TV Show) that are of 17 years+ category. However, countries like India, Japan and South Korea can also accommodate 14 years+ category content.